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(57) ABSTRACT 

Splintered offloading techniques with receive batch process- 
ing are described for network acceleration. Such techniques 
offload specific functionality to a NIC while maintaining the 
bulk of the protocol processing in the host operating system 
(“OS”). The resulting protocol implementation allows the 
application to bypass the protocol processing of the received 
data. Such can be accomplished this by moving data from the 
NIC directly to the application through direct memory access 
(“DMA”) and batch processing the receive headers in the host 
OS when the host OS is interrupted to perform other work. 
Batch processing receive headers allows the data path to be 
separated from the control path. Unlike operating system 
bypass, however, the operating system still fully manages the 
network resource and has relevant feedback about traffic and 
flows. Embodiments of the present disclosure can therefore 
address the challenges of networks with extreme bandwidth 
delay products (BWDP). 

38 Claims, 10 Drawing Sheets 
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NETWORK ACCELERATION TECHNIQUES 

RELATED APPLICATIONS 

This application claims the benefit of U.S. Provisional 
Patent Application Ser. No. 61/004,955, entitled “10-100 
Gbps offload NIC for WAN, NLR, Grid computing” filed 3 
Dec. 2007, and also claims the benefit of U.S. Provisional 
Patent Application Ser. No. 61/063,843, entitled “Splintered 
TCP offload engine for grid computing and BDWP” filed 7 
Feb. 2008; the entire contents of both of which applications 
are incorporated herein by reference. 

FEDERALLY SPONSORED RESEARCH OR 
DEVELOPMENT 

This invention was made with government support by (i) 
the National Aeronautics and Space Administration (NASA), 
under contract No. SBIR 06-1-S8.05-8900, and (ii) the 
National Science Foundation, under contract No. STTR 
Grant IIP-0637280. The Government has certain rights in the 
invention. 

BACKGROUND 

The rapid growth of computer networks in the past decade 
has brought, in addition to well known advantages, disloca- 
tions and bottlenecks in utilizing conventional network 
devices. For example, a CPU of a computer connected to a 
network may spend an increasing proportion of its time pro- 
cessing network communications, leaving less time available 
for other work. In particular, file data exchanges between the 
network and a storage unit of the computer, such as a disk 
drive, are performed by dividing the data into packets for 
transportation over the network. Each packet is encapsulated 
in layers of control information that are processed one layer at 
a time by the receiving computer CPU. 

Although the speed of CPUs has constantly increased, this 
type of protocol processing can consume most of the avail- 
able processing power of the fastest commercially available 
CPU. A rough estimation indicates that in a Transmission 
Control Protocol (TCP)/Internet Protocol (IP) network, one 
currently needs one hertz of CPU processing speed to process 
one bit per second of network data. Furthermore, evolving 
technologies such as IP storage, streaming video and audio, 
online content, virtual private networks (VPN) and e-com- 
merce, require data security and privacy like IP Security 
(IPSec), Secure Sockets Layer (SSL) and Transport Layer 
Security (TLS) that increase even more the computing 
demands from the CPU. Thus, the network traffic bottleneck 
has shifted from the physical network to the host CPU. 

Most network computer communication is accomplished 
with the aid of layered software architecture for moving infor- 
mation between host computers connected to the network. 
The general functions of each layer are normally based on an 
international standard defined by the International Standards 
Organization (ISO), named the Open Systems Interconnec- 
tion (OSI) network model. The OSI model sets forth seven 
processing layers through which information received by a 
host passes and made presentable to an end user. Similarly, 
those seven processing layers may be passed in reverse order 
during transmission of information from a host to the net- 
work. 

It is well known that networks may include, for instance, a 
high-speed bus such as an Ethernet connection or an internet 
connection between disparate local area networks (LANs), 
each of which includes multiple hosts or any of a variety of 
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other known means for data transfer between hosts. Accord- 
ing to the OSI standard, Physical layers are connected to the 
network at respective hosts, providing transmission and 
receipt of raw data bits via the network. A Data Link layer is 
5 serviced by the Physical layer of each host, the Data Link 
layers providing frame division and error correction to the 
data received from the Physical layers, as well as processing 
acknowledgment frames sent by the receiving host. A Net- 
work layer of each host, used primarily for controlling size 
to and coordination of subnets of packets of data, is serviced by 
respective Data Link layers. A Transport layer is serviced by 
each Network layer, and a Session layer is serviced by each 
Transport layer within each host. Transport layers accept data 
from their respective Session layers, and split the data into 
15 smaller units for transmission to Transport layers of other 
hosts, each such Transport layer concatenating the data for 
presentation to respective Presentation layers. Session layers 
allow for enhanced communication control between the 
hosts. Presentation layers are serviced by their respective 
20 Session layers, the Presentation layers translating between 
data semantics and syntax which may be peculiar to each host 
and standardized structures of data representation. Compres- 
sion and/or encryption of data may also be accomplished at 
the Presentation level. Application layers are serviced by 
25 respective Presentation layers, the Application layers trans- 
lating between programs particular to individual hosts and 
standardized programs for presentation to either an applica- 
tion or an end user. 

The rules and conventions for each layer are called the 
30 protocol of that layer, and since the protocols and general 
functions of each layer are roughly equivalent in various 
hosts, it is useful to think of communication occurring 
directly between identical layers of different hosts, even 
though these peer layers do not directly communicate without 
35 information transferring sequentially through each layer 
below. Each lower layer performs a service for the layer 
immediately above it to help with processing the communi- 
cated information. Each layer saves the information for pro- 
cessing and service to the next layer. Due to the multiplicity of 
40 hardware and software architectures, devices, and programs 
commonly employed, each layer is necessary to insure that 
the data can make it to the intended destination in the appro- 
priate form, regardless of variations in hardware and software 
that may intervene. 

45 In preparing data for transmission from a first to a second 
host, some control data is added at each layer of the first host 
regarding the protocol of that layer, the control data being 
indistinguishable from the original (payload) data for all 
lower layers of that host. Thus an Application layer attaches 
50 an application header to the payload data, and sends the 
combined data to the Presentation layer of the sending host, 
which receives the combined data, operates on it, and adds a 
presentation header to the data, resulting in another combined 
data packet. The data resulting from combination of payload 
55 data, application header and presentation header is then 
passed to the Session layer, which performs required opera- 
tions including attaching a session header to the data, and 
presenting the resulting combination of data to the transport 
layer. This process continues as the information moves to 
60 lower layers, with a transport header, network header and data 
link header and trailer attached to the data at each of those 
layers, with each step typically including data moving and 
copying, before sending the data as bit packets, over the 
network, to the second host. 

65 The receiving host generally performs the reverse of the 
above-described process, beginning with receiving the bits 
from the network, as headers are removed and data processed 
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in order from the lowest (Physical) layer to the highest (Ap- 
plication) layer before transmission to a destination of the 
receiving host. Each layer of the receiving host recognizes 
and manipulates only the headers associated with that layer, 
since, for that layer, the higher layer control data is included 
with and indistinguishable from the payload data. Multiple 
interrupts, valuable CPU processing time and repeated data 
copies may also be necessary for the receiving host to place 
the data in an appropriate form at its intended destination. 

As networks grow increasingly popular and the informa- 
tion communicated thereby becomes increasingly complex 
and copious, the need for such protocol processing has 
increased. It is estimated that a large fraction of the processing 
power of a host CPU may be devoted to controlling protocol 
processes, diminishing the ability of that CPU to perform 
other tasks. Network interface cards (NICs) have been devel- 
oped to help with the lowest layers, such as the Physical and 
Data Link layers. It is also possible to increase protocol 
processing speed by simply adding more processing power or 
CPUs according to conventional arrangements. This solution, 
however, is both awkward and expensive. The complexities 
presented by various networks, protocols, architectures, oper- 
ating devices and applications generally require extensive 
processing to afford communication capability between vari- 
ous network hosts. 

The TCP/IP model is a specification for computer network 
protocols created in the 1970s by DARPA, an agency of the 
United States Department of Defense. It laid the foundations 
for ARPANET, which was the world’s first wide area network 
and a predecessor of the Internet. The TCP/IP Model is some- 
times called the Internet Reference Model, the DoD Model or 
the ARPANET Reference Model. 

TCP/IP is generally described as having four abstraction 
layers (RFC 1122), e.g., as shown in the box below: 


Application 

Transport (TCP or UDP) 
Internet (IP) 

Link 


This layer view is often compared with the seven-layer OSI 
Reference Model formalized after the TCP/IP specifications. 

Regarding the layers in the TCP/IP model, the layers near 
the top are logically closer to the user application, while those 
near the bottom are logically closer to the physical transmis- 
sion of the data. Viewing layers as providing or consuming a 
service is a method of abstraction to isolate upper layer pro- 
tocols from the nitty-gritty detail of transmitting bits over, for 
example, Ethernet and collision detection, while the lower 
layers avoid having to know the details of each and every 
application and its protocol. This abstraction also allows 
upper layers to provide services that the lower layers cannot, 
or choose not to, provide. Again, the original OSI Reference 
Model was extended to include connectionless services 
(OSIRM CL). For example, IP is not designed to be reliable 
and is a best effort delivery protocol. This means that all 
transport layer implementations must choose whether or not 
to provide reliability and to what degree. UDP provides data 
integrity (via a checksum) but does not guarantee delivery; 
TCP provides both data integrity and delivery guarantee (by 
retransmitting until the receiver acknowledges the reception 
of the packet). 

The following is a description of each layer in the TCP/IP 
networking model starting from the lowest level. The Link 
Layer is the networking scope of the local network connec- 
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tion to which a host is attached. This regime is called the link 
in Internet literature. This is the lowest component layer of the 
Internet protocols, as TCP/IP is designed to be hardware 
independent. As a result TCP/IP has been implemented on top 
5 of virtually any hardware networking technology in exist- 
ence. The Link Layer is used to move packets between the 
Internet Layer interfaces of two different hosts on the same 
link. The processes of transmitting packets on a given link and 
receiving packets from a link can be controlled both in the 
10 software device driver for the network card, as well as on 
firmware or specialist chipsets. These will perform data link 
functions such as adding a packet header to prepare it for 
transmission, then actually transmit the frame over a physical 
medium. The TCP/IP model includes specifications of trans- 
15 lating the network addressing methods used in the Internet 
Protocol to data link addressing, such as Media Access Con- 
trol (MAC), however all other aspects below that level are 
implicitly assumed to exist in the Link Layer, but are not 
explicitly defined. The Link Layer can also be the layer where 
20 packets are intercepted to be sent over a virtual private net- 
work or other networking tunnel. When this is done, the Link 
Layer data is considered as application data and proceeds 
back down the IP stack for actual transmission. On the receiv- 
ing end, the data goes up through the IP stack twice (once for 
25 routing and the second time for the tunneling function). In 
these cases a transport protocol or even an application scope 
protocol constitutes a virtual link placing the tunneling pro- 
tocol in the Link Layer of the protocol stack. Thus, the TCP/IP 
model does not dictate a strict hierarchical encapsulation 
30 sequence and the description is dependent upon actual use 
and implementation. 

Internet Layer: As originally defined, the Internet layer (or 
Network Layer) solves the problem of getting packets across 
a single network. Examples of such protocols are X.25, and 
35 the ARPANET’S Host/IMP Protocol. With the advent of the 
concept of internetworking, additional functionality was 
added to this layer, namely getting data from the source 
network to the destination network. This generally involves 
routing the packet across a network of networks, known as an 
40 internetwork or internet (lower case). In the Internet Protocol 
Suite, IP performs the basic task of getting packets of data 
from source to destination. IP can carry data for a number of 
different upper layer protocols. These protocols are each 
identified by a unique protocol number: ICMP and IGMP are 
45 protocols 1 and 2, respectively. Some of the protocols carried 
by IP, such as ICMP (used to transmit diagnostic information 
about IP transmission) and IGMP (used to manage IP Multi- 
cast data) are layered on top of IP but perform internetwork 
layer functions. This illustrates an incompatibility between 
50 the Internet and the IP stack and OSI model. Some routing 
protocols, such as OSPF, are also part of the network layer. 

Transport Layer: The Transport Layer’s responsibilities 
include end-to-end message transfer capabilities independent 
of the underlying network, along with error control, fragmen- 
55 tation and flow control. End to end message transmission or 
connecting applications at the transport layer can be catego- 
rized as either: connection-oriented e.g. TCP, or connection- 
less e.g. UDP. The Transport Layer can be thought of literally 
as a transport mechanism e.g. a vehicle whose responsibility 
60 is to make sure that its contents (passengers/goods) reach its 
destination safely and soundly, unless a higher or lower layer 
is responsible for safe delivery. The Transport Layer provides 
this service of connecting applications together through the 
use of ports. Since IP provides only a best effort delivery, the 
65 Transport Layer is the first layer of the TCP/IP stack to offer 
reliability. Note that IP can run over a reliable data link 
protocol such as the High-Level Data Link Control (HDLC). 
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Protocols above transport, such as RPC, also can provide 
reliability. For example, TCP is a connection-oriented proto- 
col that addresses numerous reliability issues to provide a 
reliable byte stream: data arrives in-order; data has minimal 
error (i.e., correctness); duplicate data is discarded; lost/dis- 5 
carded packets are re-sent; and, includes traffic congestion 
control. The newer SCTP is also a “reliable”, connection- 
oriented, transport mechanism. It is Message-stream-ori- 
ented, not byte-stream-oriented like TCP, and provides mul- 
tiple streams multiplexed over a single connection. It also to 
provides multi-homing support, in which a connection end 
can be represented by multiple IP addresses (representing 
multiple physical interfaces), such that if one fails, the con- 
nection is not interrupted. It was developed initially for tele- 
phony applications (to transport SS7 over IP), but can also be 15 
used for other applications. UDP is a connectionless data- 
gram protocol. Like IP, it is a best effort or “unreliable” 
protocol. Reliability is addressed through error detection 
using a weak checksum algorithm. UDP is typically used for 
applications such as streaming media (audio, video, Voice 20 
over IP etc) where on-time arrival is more important than 
reliability, or for simple query/response applications like 
DNS lookups, where the overhead of setting up a reliable 
connection is disproportionately large. RTP is a datagram 
protocol that is designed for real-time data such as streaming 25 
audio and video. TCP and UDP are used to carry an assort- 
ment of higher-level applications. The appropriate transport 
protocol is chosen based on the higher-layer protocol appli- 
cation. For example, the File Transfer Protocol expects a 
reliable connection, but the Network File System assumes 30 
that the subordinate Remote Procedure Call protocol, not 
transport, will guarantee reliable transfer. Other applications, 
such as VoIP, can tolerate some loss of packets, but not the 
reordering or delay that could be caused by retransmission. 
The applications at any given network address are distin- 35 
guished by their TCP orUDPport. By convention certain well 
known ports are associated with specific applications. (See 
List of TCP and UDP port numbers.) 

Application Layer: The Application Layer refers to the 
higher-level protocols used by most applications for network 40 
communication. Examples of application layer protocols 
include the File Transfer Protocol (FTP) and the Simple Mail 
Transfer Protocol (SMTP). Data coded according to applica- 
tion layer protocols are then encapsulated into one or (occa- 
sionally) more transport layer protocols (such as the Trans- 45 
mission Control Protocol (TCP) or User Datagram Protocol 
(UDP)), which in turn use lower layer protocols to effect 
actual data transfer. Since the IP stack defines no layers 
between the application and transport layers, the application 
layer must include any protocols that act like the OSI’s pre- 50 
sentation and session layer protocols. This is usually done 
through libraries. Application Layer protocols generally treat 
the transport layer (and lower) protocols as “black boxes” that 
provide a stable network connection across which to commu- 
nicate, although the applications are usually aware of key 55 
qualities of the transport layer connection such as the end 
point IP addresses and port numbers. As noted above, layers 
are not necessarily clearly defined in the Internet protocol 
suite. Application layer protocols are most often associated 
with client-server applications, and the commoner servers 60 
have specific ports assigned to them by the IANA: HTTP has 
port 80; Telnet has port 23; etc. Clients, on the other hand, 
tend to use ephemeral ports, i.e. port numbers assigned at 
random from a range set aside for the purpose. Transport and 
lower level layers are largely unconcerned with the specifics 65 
of application layer protocols. Routers and switches do not 
typically “look inside” the encapsulated traffic to see what 
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kind of application protocol it represents, rather they just 
provide a conduit for it. However, some firewall and band- 
width throttling applications do try to determine what’s 
inside, as with the Resource Reservation Protocol (RSVP). 
It’s also sometimes necessary for Network Address Transla- 
tion (NAT) facilities to take account of the needs of particular 
application layer protocols. (NAT allows hosts on private 
networks to communicate with the outside world via a single 
visible IP address using port forwarding, and is an almost 
ubiquitous feature of modern domestic broadband routers). 

Hardware and software implementation: Normally, appli- 
cation programmers are concerned only with interfaces in the 
Application Layer and often also in the Transport Layer, 
while the layers below are services provided by the TCP/IP 
stack in the operating system. Microcontroller firmware in the 
network adapter typically handles link issues, supported by 
driver software in the operational system. Non-pro gram- 
mable analog and digital electronics are normally in charge of 
the physical components in the Link Layer, typically using an 
application-specific integrated circuit (ASIC) chipset for 
each network interface or other physical standard. Hardware 
or software implementation is, however, not stated in the 
protocols or the layered reference model. High-performance 
routers are to a large extent based on fast non-programmable 
digital electronics, carrying out link level switching. 

Network bandwidth is increasingly faster than host proces- 
sors can process traditional protocols. Interrupt pressure has 
been the bottleneck for TCP/IP over increasing network 
bandwidths. The solutions that have generally been proposed 
to alleviate this bottleneck are interrupt coalescing and net- 
polling, jumbo frames, and TCP offload. Interrupt coalescing 
and jumbo frames are becoming standards in high-perfor- 
mance networking. However, neither of them delivers a large 
enough impact at 10 Gbps network speeds and beyond. Sev- 
eral factors have made full TCP offload a less attractive alter- 
native. Full TCP offload requires that all protocol processing 
be handled by the NIC. This requires a very sophisticated NIC 
with a great deal of memory for buffering purposes. They are, 
therefore, cost-prohibitive. Additionally, the memory and 
processing required make Full TCP Offload scale poorly. Full 
TCP processing on the NIC also moves control of the network 
resource away from the operating system. This fundamen- 
tally erodes the security of the host since the OS does not have 
full control of what is entering the memory space or the 
protocol stack space. Also, the OS has difficulty making 
dynamic policy decisions based on potential attacks or 
changes in network traffic. TCP Data Path Offload, in which 
the flows are created by the OS, but the protocol processing 
associated with data movement is offloaded, addresses the 
first issue, but cannot address the second issue since informa- 
tion about the status of the network is not routinely shared 
with the OS during the flow of data. What is desired, there- 
fore, are improved techniques that can allow for quicker data 
transfer and can address the needs of networks having rela- 
tively high bandwidth delay products. 

SUMMARY 

The present disclosure is directed to techniques, including 
methods and architectures, for the acceleration of file trans- 
fers over networks. Such techniques can provide for the split- 
ting or “splintering” of packet headers and related files/data 
during offloading processes. 

An aspect of the present disclosure provides engine sys- 
tems utilizing splintered offload logic. Such engines can 
include or be implemented with one or more physical inter- 
faces, media access controllers (“MAC”s), and backplane 
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interfaces. Such engines (or portions of such) can be incor- 
porated into NIC circuits including single or multiple com- 
ponents, e.g., field programmable gate arrays (“FPGA”s), 
application specific integrated circuits (“ASIC’s), and the 
like. 5 

Another aspect of the present disclosure provides systems 
that are based upon unique coding and architecture derived 
from splintered UDP offload technology, resulting in unique 
FPGA core architectures and firmware (e.g., offload engines). 

Embodiments of novel offload engine according to the to 
present disclosure includes NIC architecture with network 
connections at 10 Gbps, scaling by nxlO Gbps increments. 

One skilled in the art will appreciate that embodiments of 
the present disclosure can be implemented in hardware, soft- 
ware, firmware, or any combinations of such, and over one or 15 
more networks. 

Other features and advantages of the present disclosure 
will be understood upon reading and understanding the 
detailed description of exemplary embodiments, described 
herein, in conjunction with reference to the drawings. 20 

BRIEF DESCRIPTION OF THE DRAWINGS 

Aspects of the disclosure may be more fully understood 
from the following description when read together with the 25 
accompanying drawings, which are to be regarded as illustra- 
tive in nature, and not as limiting. The drawings are not 
necessarily to scale, emphasis instead being placed on the 
principles of the disclosure. In the drawings: 

FIG. 1 depicts a diagrammatic view of a path of a splintered 30 
packet (e.g., TCP) through a splintered stack architecture, in 
accordance with exemplary embodiments of the present dis- 
closure; 

FIG. 2 depicts a diagrammatic view of a NIC circuit archi- 
tecture in accordance with an exemplary embodiment of the 35 
present disclosure; 

FIG. 3 depicts a diagrammatic view of a splintered offload 
engine in accordance with an exemplary embodiment of the 
present disclosure; 

FIG. 4 depicts an enlarged view of a portion of FIG. 3 40 
showing a packet receive process and architecture in accor- 
dance with an exemplary embodiment of the present disclo- 
sure; 

FIG. 5 depicts a diagrammatic view of an alternate packet 
transmit proces s and architecture in accordance with a further 45 
embodiment of the present disclosure; 

FIG. 6 depicts diagrammatic representation of 40 Gbps 
bandwidth and 60 Gbps bandwidth embodiments of the 
present disclosure; 

FIG. 7 depicts a diagrammatic view of an extensible mes- 50 
sage oriented offload model (“EMO”) for a receive process, in 
accordance with an embodiment of the present disclosure; 

FIG. 8 depicts a method of packet splintering in accordance 
with exemplary embodiments; 

FIG. 9 depicts a packet processing method in accordance 55 
with an embodiment of the present disclosure; 

FIG. 10 depicts a further embodiment of a packet process- 
ing method, in accordance with the present disclosure; 

FIG. 11 depicts a further embodiment of a packet process- 
ing method, in accordance with the present disclosure; 60 

FIG. 12 depicts a processing method, in accordance with 
the present disclosure; and 

FIG. 13 depicts a further embodiment of a method in accor- 
dance with the present disclosure. 

While certain embodiments are depicted in the drawings, 65 
one skilled in the art will appreciate that the embodiments 
depicted are illustrative and that variations of those shown, as 
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well as other embodiments described herein, may be envi- 
sioned and practiced within the scope of the present disclo- 
sure. 

DETAILED DESCRIPTION 

Aspects of the present disclosure generally relate to tech- 
niques utilizing novel offload engines based on the architec- 
tures implementing splinter offload (or “splintering”) logic. 
Such techniques split off packet data from associated packet 
header (or descriptor) information. Some variations of splin- 
tered offload include/address IP headers where others include 
TCP headers, and other could include both. Each header has 
many parameters. Common vocabulary or terminology in 
both types of headers (IP and TCP) include: source, destina- 
tion, and/or checksum — priority or urgency. Such architec- 
tures can be based on low-cost, high-performance FPGA 
subsystems. Using network simulations and modeling, 
embodiments have been verified, e.g., as system feasibility 
for bandwidths from 10-100+ Gbps. The offload engine sys- 
tem can allow access to distributed and shared data over 10 
Gbps and beyond, for various networks. Such techniques can 
run/implement splintered UDP or TCP on our system up to 
100 Gbps. System can accordingly be compatible 10 GigE 
networking infrastructure, allow for bandwidth scalability. 
As faster versions of the busses become available, e.g., PCI 
express bus, embodiments of the present disclosure can pro- 
vide splintered TCP and UDP operation at higher rates, e.g., 
128 Gbps to 1,000+ Gbps f-d for Terabit Ethernet applica- 
tions. 

Splintered offloading techniques (TCP or UDP) with 
receive batch processing address most of the issues associated 
with TCP offload, but at a significantly reduced manufactur- 
ing price to offload specific functionality to the NIC while 
maintaining the bulk of the protocol processing in the host 
OS. This is the core of Splintered offloading according to the 
present disclosure. The resulting protocol implementation 
allows the application to bypass the protocol processing of the 
received data. Such can be accomplished this by moving data 
from the NIC directly to the application through DMA and 
batch processing the receive headers in the host OS when the 
host OS is interrupted to perform other work. Batch process- 
ing receive headers allows the data path to be separated from 
the control path. Unlike operating system bypass, however, 
the operating system still fully manages the network resource 
and has relevant feedback about traffic and flows. Embodi- 
ments of the present disclosure can therefore address the 
challenges of networks with extreme bandwidth delay prod- 
ucts (BWDP). Example facilities include 10-100 Gbps intra- 
continental and intercontinental links at national labs and 
aerospace firms. Bulk Data Transfer in the networks need to 
be provided the endpoint resources required to ensure high 
performance in a cost effective manner. 

The present inventors have conducted research proving 
multiples of 10 Gigabits per second (Gbps) through 100 Gbps 
and higher speeds (e.g., indicated by “nxlO Gbps” in some of 
the figures). In some implementations it can be possible to fit 
into one chip or one piece of code. In other implementations, 
it would be multiples. Embodiments of the present disclosure 
can be implemented or extend to 1000 Gigabits per second, 

Aspects of the present disclosure can provide and/or incor- 
porate algorithms for the following: (i) IP (or TCP) offload 
transmit, and receive; (ii) TCP (or IP) checksum on a FPGA; 
(iii) separation of packet headers from data; (iv) packet de- 
multiplexing for pre-posted read; (v) support for out of order 
packet reception; (vi) supporting memory, accompanied 
(e.g., Verilog) subsystems as needed; and/or, (vii) supporting 
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DMA engines. The algorithms can each be translated into 
block diagrams to be used for writing, e.g., Verilog code. 

As a preliminary matter, the following definitions are used 
herein: quantum: amount of time assigned to a job. Quantum 
expiry: the time can expire in which case the priority of the job 
may be changed. Job: a program, file, or a unit of work. 
Header processing: the “utilization of’ or “calculation using” 
header parameters. Moreover, the term “storage location” can 
include reference to one or more buffers and/or permanent 
memory, e.g., in a local device/system or a distributed system 
such as over the Internet. 

Splintered TCP with Protocol Bypass 

FIG. 1 depicts a diagrammatic view of a path of a splintered 
packet (e.g., TCP) through a splintered stack architecture of 
an offload engine 100, in accordance with exemplary embodi- 
ments of the present disclosure. As shown, the engine can 
include a physical device 1 1 0, e.g. , a NIC or network interface 
circuit, interfacing with an operating system 120 and a soft- 
ware application 130. The NIC 110 can include a descriptor 
(or header) table 112. The operating system 120 can be asso- 
ciated or linked with (or connected to) host memory 122 and 
configured and arranged to perform a page pinning 124 to the 
memory 122. The application can include a receive buffer 
132. As used herein, “linked,” “connected” and “coupled” can 
have the same meaning; also, while a physical device is 
referenced as interfacing with a network, suitably functioning 
software or firmware can also or in substitution be used. 

FIG. 1 shows the path of a splintered packet (e.g., a TCP 
packet) through the architecture 100, which may be referred 
to as a “Splintered TCP” stack. The management/production 
of a Splintered TCP is designed to keep TCP flow manage- 
ment and network resource management with the operating 
system (OS) while moving data quickly and directly from the 
network interface card (NIC) to the application. Splintered 
TCP preferably includes that the application that is to receive 
data pre-post a receive to the operating system. The operating 
system can lock the page of memory associated with the 
storage location (e.g., buffer, permanent memory, or the like) 
in application-space that will receive the data. Then the oper- 
ating system creates a very small receive descriptor and 
informs the physical device (e.g., NIC) that a receive is being 
pre-posted. As policy, the operating system can choose to 
create a timeout value for the pre-posted receive so that if no 
data is received in the buffer within a certain amount of time, 
the receive is invalidated and the memory is unlocked. When 
the OS informs the physical device (e.g., NIC) of the pre- 
posted receive, a copy of the receive descriptor is added to the 
NICs pre-posted receive table. When a message arrives, the 
physical device simply checks against the table by using a 
standard hash (e.g., MD-5) of the source IP, source port, 
destination IP and destination port. If the data is part of a 
pre-posted receive, the data is sent (or DMA’d) to the appro- 
priate offset in the application memory space. The headers are 
DMA’d to the host OS in a circular queue. When the host OS 
is interrupted for other work or on quantum expiry, the OS 
processes the headers in the receive queue. 

It is important to note that normal traffic is sent to the 
operating system in the traditional manner. This allows Splin- 
tered TCP to use the normal TCP/IP stack in the operating 
system on the host to do, as it should, all error-detection and 
error-correction. 

FIG. 2 depicts a diagrammatic view of a NIC circuit sys- 
tem/architecture 200 in accordance with an exemplary 
embodiment of the present disclosure. The architecture can 
provide splintered offload of packets, at 64 Gigabits per sec- 
ond (“Gbps”), e.g., the current practical limit of PCI Express 
XI 6 Gen II (PCIe-X16 Gen II), and scalability to 100 Gbps 
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full-duplex (f-d). Systems incorporating architecture 200 can 
accordingly provide splintered packet (UDP, TCP, IP) offload 
technology, resulting in unique FPGA core and firmware 
architecture. 

5 The offload engine system 200 allows access to distributed 
and shared data over 1 0 Gbps and beyond, for networks. Such 
systems can run splintered UDP or TCP up to 100+ Gbps for 
various application. Systems can be compatible 10 GigE net- 
working infrastructure, allow for bandwidth scalability. 

1 0 Because of the inherent limitations in the TCP protocol and 

to facilitate scaling to 1 00+ Gbps f-d, the UDT variant of UDP 
can be used. Commercial applications of embodiments of the 
present disclosure can include core IP to be marketed to 
FPGA manufacturers, core IP distributors, offload engine 
15 manufacturers, and motherboard and systems manufacturers 
who require offload engine system-on-chips for their moth- 
erboards. Such can also provide an entire offload engine NIC: 
hardware and firmware to the motherboard and systems 
manufacturers of cluster and Grid computing products. 
20 Embodiments can differ from market solutions because of 
10-100 Gbps splintered TCP/IP/UDP acceleration engine, 
compatible with present networking infrastructure for Grid 
computing, while providing for future bandwidth scalability. 
FPGA Core 

25 FIG. 3 depicts a diagrammatic view of a splintered offload 
engine 300 in accordance with an exemplary embodiment of 
the present disclosure. As shown, architecture 300 can utilize 
a PCIe-X16 Genii bus in a 64 Gbps offload configuration. 

Referring to FIG. 3, the following offload system-on-chip 
30 architecture, we now discuss the receive side of the offload 
engine composition which makes -up the FPGA I .P. One MD5 
encoder output is matched against one descriptor. There are 
six descriptors, hence 6 encoders. This is for one 10 Gbps 
path. There are six such paths, but the descriptor table is the 
35 same for all. This allows for six packets to simultaneously be 
checked against the descriptor table. There are six packet 
paths for 60 Gbps total. Instead of MD5, other types of hash, 
for example but not limited to SHA-1 , have been proven to be 
feasible; others may be used as well. 

40 When the incoming packet reaches the next to last stages of 

the packet FIFO, the encoding checks for a match within the 
buffer pool (descriptor table). If there is a match, the packet 
then exits the FIFO, and at the same rate, the packet is trans- 
ferred to the listed packet buffer. When the complete packet is 
45 transferred, the DMA engine transfers the packet from the 
listed Packet Buffer to the Altera Atlantic I/F, for output to 
host over the PCIe-x 1 6 Gen II bus (64 Gbps f-d) . The Atlantic 
interface is Altera’ s standard, generic bus when connecting to 
high-speed data interfaces. The Atlantic interface is one 
50 example, and examples of other suitable interfaces can 
include, but are not limited to SPI-4.2 or later versions, FIFO 
interfaces, or generic User Space to PCI express interfaces. 

For both listed and unlisted packet buffers, the data is 
written in at 622 MHz. Either the listed packet buffer or 
55 unlisted packet buffer is write enabled and written at 622 
MHz. Since the pipeline and buffers are 16 bits wide, this 
corresponds to 10 Gbps for either path. The DMA engine 
output is at the same rate, transferring either listed or unlisted 
packets to the PCIe-xl6 Gen II bus. The design is scalable to 
60 later or subsequent versions of the PCI express bus or other 
host interfaces. The Altera GX1 30 FPGA’s are equipped with 
programmable equalization to compensate for signal degra- 
dation during transmission. This enables the ultra high-speed 
interfaces such as the PCIe signals and Altera Atlantic inter- 
65 face. In normal operation, the DMA engine transfers data out 
in the same order it came in; control logic selects between 
listed and unlisted packet buffer. The order may be overrid- 
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den, may be changed to reclaim mode (unlisted packets) or 
use tagged command queuing, depending how the host writes 
to the control registers. 

With continued reference to FIG. 3, the Atlantic interface 
(I/F) is Altera’ s standard, generic bus when connecting to 5 
high-speed data interfaces. One Atlantic I/F is write enabled 
at a time. After the block is filled with a 32 kByte packet, the 
next Atlantic I/F is write enabled. There are a total of 6: 10 
Gbps paths for 60 Gbps. 

While being applicable to TCP/IP, system 300 is also appli- to 
cable to UDP. Splintered UDP, however, may be more 
involved. The only dependency that arises when more cores 
are added is contention for the shared resources (the MAC 
engine and the DMA engine). An assumption may be made 
that the application will poll for completion of a message. 15 

The descriptor can contain one or more of nine fields: 
SRC_IP, SRC_PORT, DSTJP, DST_PORT, BUFFER_AD- 
DRESS, BUFFER_LENGTH, TIMEOUT, FLAGS, and 
PACKET_LIST. The timeout and flags fields allow for MPI_ 
MATCH on the NIC and greatly increase the efficiency of 20 
MPI. The timeout field is necessary since a mechanism may 
be needed for returning pinned pages if memory resources are 
constrained. 
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basic CPU command functions. The embedded “program 
memory” is simply one of the FPGA resources, and is loaded 
via the FPGA control logic, during power-up and initializa- 
tion. Other examples of a suitable CPU include any embed- 
ded FPGA processor, or with external interface logic a micro- 
controller or microprocessor can be used. 

The offload engine calculates TCP checksum which is then 
compared with the original checksum in the TCP header. If 
the two values do not agree, then it is assumed that the packet 
was transmitted in error and a request is made to have the 
packet re-transmitted. The offload engine therefore “drops” 
the packet and therefore the NIC does not send the flag for 
“transaction complete” to user space. For an exemplary 
implementation, a Verilog module was created for perform- 
ing the checksum calculations and performed a bottleneck 
analysis simulation to determine the precise location for all 
checksum components (data word addition, carry add, 1 ’ s 
complement, and appending checksum to packet stream). 

FIG. 4 depicts an enlarged view of a portion of FIG. 3 
showing a packet receive process and architecture in accor- 
dance with an exemplary embodiment of the present disclo- 
sure. More particularly, FIG. 4 shows a detailed view of buffer 
304 and mux, demux, buffer 306 in FIG. 3. Architecture 400 
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6747k 
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Totals for each core within FPGA in FIG. 3 


All of the major cores required for implementing the 
SPLINTERED UDP Offload Engine are summarized in the 
FPGA table, along with the FPGA resources they require. 
This table is a consolidation of both fitted code and consump- 55 
tion per core specifications, for a total of 6 paths (60 Gbps 
f-d): 

With continued reference to FIG. 3, some of the control 
logic is also given in the system diagram. The control-path 
CPU is accessed during PCIe-X16 cycles where the host is 60 
coding-up the FPGA. The control-path CPU writes registers 
and performs “code-up” within each of the FPGA’s devices in 
conjunction with the DMA2 engine. The control-path CPU 
performs reads and sends back the results via the DMA1 
engine’s buffer, back to the host. For exemplary embodi- 65 
ments, the control path CPU can be an Altera Nios II embed- 
ded “soft processor” which comes with its own library of 


can include packet FIFO buffer 402 as part of a pipeline, e.g., 
a 622 MHz pipeline as shown though other can be imple- 
mented. Control logic 404, e.g. suitable for a MD5 match, can 
pass packet through a demux process to an unlisted packet 
buffer 406 and a listed packet buffer 407 connected to DMA 
engine 408. DMA engine 408 can be connected to interface 
410. 

As can be discerned in FIG. 4, once a packet is written into 
either buffer, that specific buffer increments its write pointer 
to the next available packet space. Once a buffer has a com- 
plete packet, its ready flag signals the logic for the DMA1 
engine. The DMA1 engine clocks data out from the buffer at 
4 GHz. This can be accomplished by using the same buffering 
and clocking logic taken from the MD5 core. 

FIG. 5 depicts a diagrammatic view of an alternate packet 
transmit process and architecture 500 in accordance with a 
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further embodiment of the present disclosure. Architecture 
500 includes dual SPI-4.2 fully duplexed interfaces, as 
shown. 

FIG. 6 depicts diagrammatic representation of a 40 Gbps 
bandwidth embodiment 6 00 A and a 60 Gbps bandwidth 
embodiment 600B, in accordance with of the present disclo- 
sure. As shown, the 40 Gbps bandwidth embodiment 600A 
can include two network interface cards, and a Generation 1 
PCI Express XI 6 backplane. The 60 Gbps bandwidth 
embodiment can include a single board offload engine run- 
ning in 1 slot through 60 Gbps f-d. 

Using the 10 Gbps data rate, the present inventors deter- 
mined the amount of bits that could be stored in 1 second; the 
memory external to the FPGA can be selected by appropriate 
scaling, as was done for an exemplary embodiment. For each 
10 Gbps path, the present inventors determined that the off- 
load NIC would need 1.1 GByte Double-Data Rate (DDR2) 
RAM to adjust a packet rate from 1 0 Gbps reduced down to 1 
Gbps. The DDR2 SDRAM specifications for waveform tim- 
ing and latencies and refresh cycles indicate that the DDR2 
SDRAM can be used on the Altera S2GX PCIe dev kit uti- 
lized for the present disclosure. Each development board used 
was provided with four xl6 devices: 
device#MT47H32M16CC-3. 

For verification purposes, the present inventors modeled 
the performance of an embodiment of FIG. 6. Accounting for 
the need for a refresh cycle, the throughput would be 700 kbit 
over a time of (70 usec+1 cycle delay), directly translating 
into 9.99 kb/usec (9.99 Gbps). For feasibility purposes, this 
bandwidth is seen as being practically the same data rate (no 
bottleneck) as 10 Gbps. Thus, for certain applications, the 
buffering internal to the FPGA can be sufficient and no exter- 
nal memory may be required on the NIC. 

Extensible Message Oriented Offload Model 

FIG. 7 depicts a diagrammatic view of an extensible mes- 
sage oriented offload model (“EMO”) 700 for a receive pro- 
cess, in accordance with an embodiment of the present dis- 
closure. The EMO model was used to verify/model the 
Splintered TCP throughput. The EMO model was verified by 
comparing the throughput of two machines using the Linux 
TCP stack and the modeled throughput. 

The EMO model 700 uses microbenchmarks combined to 
determine latency and overhead for a protocol. Drawings 
FIG. 7 shows the EMO model for a receive. EMO allows us to 
use information about the Splintered TCP NIC to estimate the 
latency and throughput of Splintered TCP. Using EMO, we 
can model the latency of a traditional TCP latency as: 
Latency=L_w+C_n/R_n+L_nh+C_h/R_h+L_ha 

The EMO model was verified as being accurate by the use 
of two Pentium Pro Duo machines with Intel el 000 NICs in a 
crossover configuration using Linux 2.6.22 operating system. 
Timings were added to both the kernel and the TCP client and 
TCP server test applications. The present inventors were 
unable (during the verification process) to (i) directly time the 
DMA from the NIC to the host OS (L_nh), and (ii) directly 
time the amount of work performed in the NIC (C_n/R_n). 
They did, however, get reasonable timings of the other 
microbenchmarks necessary to verify EMO. The EMO was 
observed to generally underestimate the latency by about 
23%, however, the gain on the system was seen to be consis- 
tent. The consistency is important as it shows that any caching 
or scheduling randomness does not affect the latency at this 
level. 

Modeling Verification 

The present inventors modeled embodiments of Splintered 
TCP using the above-described EMO. The latency of stan- 
dard TCP and TCP was initially using interrupt coalescing 


14 

using a Pentium Pro Duo with 1 .86 GHz processors, but this 
created an artificial limit in the speed of the PCI -Express bus 
and the speed of the processor. Subsequently, the present 
inventors assumed a machine with a 3 GHz processor and a 
5 PCIe bus on the order of 100 Gbps f-d (our results have 
essentially been limited by the PCIe bus bandwidth itself). 
For this, the average number of cycles on the receive host 
determined during EMO model verification (200,000) was 
used, with the assumption that there was little or no time spent 
10 on the traditional NIC. An interrupt latency (the limiting 
factor) of 4 microseconds was assumed (which is the tradi- 
tional advertised interrupt latency for Intel Pentiums). The 
limiting factor for Standard TCP is the interrupt latency (since 
we assume multiple interrupts per message). The limiting 
1 5 factor for TCP with Interrupt Coalescing is the context switch 
latency of 7.5 microseconds. Splintered TCP has no context 
switch or interrupt so the limiting factor becomes the speed of 
the PCI-Express bus. 

Splintered TCP with protocol bypass was shown to provide 
20 the performance necessary to provide per- flow bandwidth up 
to 128+ Gbps. Accordingly, embodiments of the present dis- 
closure can provide a viable, inexpensive alternative for 1 00 
Gbps networks using Ethernet. The number of connections 
that can be served by a Splintered TCP NIC may depend on 
25 the size (and therefore expense) of the NIC itself as memory 
will be the most costly addition to the Splintered TCP NIC. 
Splintered TCP connections can, for some applications, be 
brokered by an application library. 

FIG. 8 depicts a method of packet splintering in accordance 
30 with exemplary embodiments. As shown in FIG. 8, at an 
initial start stage 802 a packet can be processed by (in) splin- 
tering logic; starting stage 802 is shown linked to reference 
character 1 for the subsequent description of FIG. 10. 

With continued reference to FIG. 8, when splintering is 
35 appropriate (e.g., the header is listed in a descriptor table), the 
packet data can be transferred to an application layer (e.g., 
into a buffer or memory location/address), as described at 
804. The packet header can be transferred to the operating 
system, as described at 806. A descriptor table in hardware, 
40 e.g., NIC, can be updated to receipt of the packet data, as 
described at 808. 

FIG. 9 depicts a packet processing method 900 in accor- 
dance with an embodiment of the present disclosure; starting 
stage 902 is shown linked to reference character 2 for the 
45 subsequent description of FIG. 10. A packet can be trans- 
ferred to an internet protocol layer, as described at 902. The 
packet can be processed by the internet protocol layer, as 
described at 904. The packet can then be transferred to the 
transport layer, as described at 906. 

50 Continued with the description of method 900, the packet 
can be processed in the transport layer, as described at 908. 
The packet data can be transferred to an application layer, as 
described at 910. The data can then be processed in the appli- 
cation layer, as described at 912. 

55 FIG. 10 depicts a further embodiment of a packet process- 
ing method 1000, in accordance with the present disclosure. 
Method 1000 includes options for implementing procedures/ 
methods according to FIGS. 9-10, and 11, as will be 
described. 

60 For method 1000, an application can send descriptor/ 
header contents to an operating system, as described at 1002. 
The operating system can perform a non-blocking socket read 
from the application, as described at 1004. The operating 
system can then attempt to pin a page (or pages) in host 
65 memory, as described at 1006. If the pinning fails, the oper- 
ating system can perform a batch process of all headers (indi- 
cated by “3”), as further shown and described for FIG. 11. 
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In response to a successful pinning page pass, the descrip- 
tor can be processed in the operating system and put into a 
descriptor table in the operating system, as described at 1010. 
The operating system can then send (e.g., DMA) the descrip- 
tor to an NIC (hardware) descriptor table, as described at 
1012. The received packet (data) can be input to the NIC 
physical layer, as described at 1014. The packet can be pro- 
cessed in the physical layer, as described at 1016. The packet 
can be transferred from the physical layer to a NIC data link 
later, as described at 1018, for processing, as described at 
1020. 

Continuing with the description of method 1000, a query 
can be performed to see if the packet is listed in the descriptor 
table, as described at 1022. If the packet is not listed in the 
descriptor table, normal processing of the packet can occur 
(indicated by “2”), e.g., as previously described for method 
900. If, on the other hand, the packet is listed in the descriptor 
table, the packet can then be transferred to splintering logic 
(indicated by “1”), e.g., as previously described for method 
800. 

FIG. 11 depicts a further embodiment of a packet process- 
ing method 1100, in accordance with the present disclosure. 
Method 1100 can be useful in the case where a pinning 
attempt fails, e.g., for an unsuccessful outcome at 1008 of 
method 1000. 

For method 1000, in response to an unsuccessful pinning 
attempt, the operating system can perform a batch process of 
all headers, e.g., those in a headers -to-be-processed ring, as 
described at 1102. The associated application can negotiate 
for memory, as described at 1104. The application can per- 
form a de-queue-receive system call, as described at 1106. 
The operating system can remove the descriptor from the 
descriptor table in the operating system, as described at 1 108 . 
The operating system can re-queue the descriptor onto the 
NIC with a flag being set, as described at 1110. The NIC can 
then remove the descriptor from the NIC descriptor table, as 
described at 1112. 

FIG. 12 depicts a processing method 1200 for an operating 
system to process headers, in accordance with an embodi- 
ment of the present disclosure. A check can be made for the 
occurrence of an interrupt, e.g., a quantum expiry or other 
interrupts, as described at 1202. Upon the occurrence of such 
an interrupt, an operating system can batch process all head- 
ers stored, e.g., in a headers-to-be-processed ring/buffer, as 
described at 1204. A determination can be made as to whether 
a header is associated with a TCP packet, as described at 
1206. 

Continuing with the description of method 1200, in 
response to a determination that the header is associated with 
a TCP packet, acknowledgment can be created and send (e.g., 
by a DMA process) to a transmit ring on the NIC, as described 
at 1208. Then (or after a negative determination at 1206) the 
operating system can update its descriptor table, as described 
at 1210. It should be understood, that except for 1206, all 
other instances of “TCP” as used herein are applicable to 
UDP. 

FIG. 13 depicts a method 1300 of transmitting processed 
data after splintered logic processing has occurred, in accor- 
dance with embodiments of the present disclosure. In method 
1300, data that is to be transmitted can be input into an 
application layer, e.g., data that has been “splintered” off of a 
packet by method 800 of FIG. 8, as described at 1302. The 
data can then be processed in the application layer, as 
described at 1304. The data can be transferred to a transport 
layer, as described at 1306, for processing, as described at 
1308. 
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Continuing with the description of method 1300, the data 
can be transferred to an internet protocol (“IP”) layer, as 
described at 1310, for processing in the IP layer, as described 
at 1312. The data can be transferred to a data link, as described 
5 at 1314, and processed in the data link, as described at 1316. 
The data can then be transferred to a physical layer, as 
described at 1318, and processed in the physical layer, as 
described at 1320. The data can then be transferred to a 
network, as described at 1322. 
to Testing 

The control logic, registers, decoding, and internal selects 
for each device have been shown in the previous figure by a 
single box “ctrl logic.” During the proof of concept testing, 
the present inventor(s) used in-house library of control func- 
15 tions and derived an approximate amount of logic (Verilog 
equations) for this unit. Off-the-shelf Verilog code was avail- 
able for the Atlantic Interface and control logic. Using the 
Altera Quartus II FPGA tools, the present inventor(s) synthe- 
sized and fit the logic into an Altera GX130 FPGA, consum- 
20 ing only 12-20% of FPGA on-chip resources. 

For completed testing. Verilog coding and test bench simu- 
lation towards functions with either critical logic or potential 
bottlenecks, in order to prove that the data path was feasible to 
support rates of nxlO Gbps. Modelsim was used to simulate 
25 the data flow between packet fifo, demux logic, listed packet 
buffer, and buffer to DMA. The results of the simulation were 
that data flow was functional as given in the previous dia- 
grams, and we verified that there were no bandwidth bottle- 
necks: our system design was proven to be feasible. 

30 Accordingly, embodiments of the present disclosure can 
provide various advantages over the prior art; such advan- 
tages can include the ability to increase file transfer rates over 
networks and/or provide file transfer functionality with 
reduced cost. As faster versions of the busses become avail- 
35 able, e.g., PCI express bus, embodiments of the present dis- 
closure can provide splintered TCP and UDP operation at 
higher rates, e.g., 128 Gbps to 1,000+ Gbps f-d for Terabit 
Ethernet applications. 

While certain embodiments have been described herein, it 
40 will be understood by one skilled in the art that the methods, 
systems, and apparatus of the present disclosure may be 
embodied in other specific forms without departing from the 
spirit thereof. 

Accordingly, the embodiments described herein, and as 
45 claimed in the attached claims, are to be considered in all 
respects as illustrative of the present disclosure and not 
restrictive. 

What is claimed is: 

1 . A method of network acceleration comprising: 

50 instructing an application to send descriptor contents to an 
operating system; 

instructing the operating system to perform a non-blocking 
socket read from the application; 

instructing the operating system to attempt to pin a page in 
55 host memory; 

in response to a successful passing of the page pinning, 

instructing the operating system to process the descrip- 
tor; 

instructing the operating system to put the descriptor into 
60 its descriptor table; 

instructing the operating system to send via dynamic 
memory access (DMA) the processed descriptor to a 
network interface device descriptor table; 

instructing that a received (RX) packet is input to a network 
65 interface device physical layer; 

instructing that the packet is processed in the network 
interface device physical layer; 
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instructing that the packet is transferred from the network 
interface device physical layer to a network interface 
device data link layer; 

instructing that the packet is processed in the network 
interface device data link layer; 
instructing that a query is made to see if the packet is listed 
in the descriptor table; 

in response to the packet being listed in the descriptor table, 
transferring the packet to splintering logic; 
in response to an unsuccessful passing of the page pinning, 
instructing the operating system to process all headers in 
a headers-to-be-processed ring; 
instructing the application to negotiate for memory; and 
instructing the application to perform de-queuing and 
receive a system call. 

2. The method of claim 1, further comprising: 
instructing the operating system to remove the packet 

descriptor from the operating system descriptor table; 
instructing the operating system to re-queue the packet 
descriptor onto the network interface device with a flag 
set; and 

updating the network interface device hardware descriptor 
table so that the network interface device removes the 
descriptor from the network interface device descriptor 
table. 

3. The method of claim 1, wherein the network interface 
device comprises a network interface card (NIC) circuit. 

4. The method of claim 3, wherein the NIC circuit is con- 
figured and arranged to have a bandwidth of about 5 Gbps to 
about 1,000 Gbps. 

5. The method of claim 4, wherein the NIC circuit is con- 
figured and arranged to have a bandwidth of about 10 Gbps. 

6. The method of claim 1, further comprising updating a 
descriptor table in the network interface device. 

7. A method of processing a packet with splintering logic, 
the method comprising: 

providing a network interface circuit with a packet having 
a packet header descriptor and packet data; 
transferring packet data to a storage location in an appli- 
cation layer linked to the network interface circuit, 
wherein the storage location is a receive buffer utilized 
by the application layer to receive data; 
transferring a packet header to an operating system linked 
to the network interface circuit and the application layer, 
and wherein use of a transmission control protocol/in- 
temet protocol (TCP/IP) stack or a UDP/IP stack is 
avoided for the transferring of the packet header; 
providing instructions to an application to send descriptor 
contents to the operating system; 
providing instructions to the operating system to perform a 
non-blocking socket read from the application; 
providing instructions to the operating system to attempt to 
pin a page in host memory; 

in response to a successful passing of the page pinning, 
providing instructions to the operating system to process 
the descriptor; 

providing instructions to the operating system to put the 
descriptor into its descriptor table; 
providing instructions to the operating system to send via 
dynamic memory access (DMA) the processed descrip- 
tor to a network interface device descriptor table; 
providing instructions that a received (RX) packet is input 
to a network interface device physical layer; 
providing instructions to the network interface device 
physical layer for processing the packet; 
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providing instructions to the network interface device 
physical layer for transferring the packet from the net- 
work interface device physical layer to a network inter- 
face device data link layer; 

5 providing instructions to the network interface device data 

link layer to process the packet; 
providing instructions to make a query to see if the packet 
is listed in the descriptor table; 
in response to the packet being listed in the descriptor table, 
to transferring the packet to splintering logic; 

in response to an unsuccessful passing of the page pinning, 
providing instructions to the operating system to process 
all headers in a headers-to-be-processed ring; 
providing instructions to the application to negotiate for 
15 memory; and 

providing instructions to the application to perform de- 
queuing and receive a system call. 

8. The method of claim 7, wherein the packet is a transmis- 
sion control protocol (TCP) packet. 

20 9. The method of claim 7, wherein the packet is a user 

datagram protocol (UDP) packet. 

1 0 . The method of claim 7, wherein the packet is an internet 
protocol (IP) packet. 

11. The method of claim 7, wherein the network interface 
25 circuit comprises a network interface card (NIC) circuit. 

12. The method of claim 11, wherein the NIC circuit is 
configured and arranged to have a bandwidth of about 5 Gbps 
to about 1,000 Gbps. 

13. The method of claim 12, wherein the NIC circuit is 
30 configured and arranged to have a bandwidth of about 10 

Gbps. 

14. The method of claim 7, further comprising updating a 
descriptor table in the network interface circuit. 

15. A computer-executable program product comprising a 
35 computer-readable non- transitory storage medium with resi- 
dent computer-readable instructions, the computer readable 
instructions comprising: 

instructions for providing a network interface device with 
a packet having a packet header and packet data; 

40 instructions for transferring packet data to a buffer in an 
application layer linked to the network interface device; 
instructions for transferring a packet header to an operating 
system (OS) linked to the network interface device and 
the application layer, wherein use of a TCP/IP stack or a 
45 UDP/IP stack is avoided for the transferring of the 
packet header; 

instructions for an application to send descriptor contents 
to the operating system; 

instructions for the operating system to perform a non- 
50 blocking socket read from the application; 

instructions for the operating system to attempt to pin a 
page in host memory; 

instructions for, in response to a successful passing of the 
page pinning, the operating system to process the 
55 descriptor; 

instructions for the operating system to put the descriptor 
into its descriptor table; 

instructions for the operating system to send via dynamic 
memory access (DMA) the processed descriptor to a 
60 network interface device descriptor table; 

instructions that a received (RX) packet is input to a net- 
work interface device physical layer; 
instructions that the packet is processed in the network 
interface device physical layer; 

65 instructions that the packet is transferred from the network 
interface device physical layer to a network interface 
device data link layer; 



US 8,103,785 B2 


19 


20 


instructions that the packet is processed in the network 
interface device data link layer; 
instructions that a query is made to see if the packet is listed 
in the descriptor table; 

instructions for, in response to the packet being listed in the 5 
descriptor table, transferring the packet to splintering 
logic; 

instruction for, in response to an unsuccessful passing of 
the page pinning, the operating system to process all 
headers in a headers-to-be-processed ring; to 

instructions for the application to negotiate for memory; 
and 

instructions for the application to perform de-queuing and 
receive a system call. 

16 . The program product of claim 15, wherein the packet is 15 
a transmission control protocol (TCP) packet. 

17. The program product of claim 15, wherein the packet is 
a user datagram protocol (UDP) packet. 

18 . The program product of claim 15, wherein the packet is 

an internet protocol (IP) packet. 20 

19. The program product of claim 15, further comprising 
instructions for processing packet headers upon the occur- 
rence of an OS interrupt. 

20. The program product of claim 15, wherein the network 
interface device comprises a network interface card (NIC) 25 
circuit. 

21. The program product of claim 20, wherein the NIC 
circuit is configured and arranged to have a bandwidth of 
about 5 Gbps to about 1,000 Gbps. 

22. The program product of claim 21, wherein the NIC 30 
circuit is configured and arranged to have a bandwidth of 
about 10 Gbps. 

23. The program product of claim 15, further comprising 
updating a descriptor table in the network interface device. 

24. A splintered packet offload engine system comprising: 35 
a network interface device configured to interface with (i) 

a network, (ii) an operating system, and (iii) an applica- 
tion, wherein the network interface device includes a 
descriptor table, the operating system is linked with host 
memory and configured to perform a page pinning to the 40 
memory, the application includes a receive buffer, and 
the network interface device comprises splinter offload 
logic, wherein the splinter offload logic is configured to 
avoid use of a TCP/IP stack or a UDP/IP stack for pro- 
cessing headers; 45 

wherein the splintered packet offload engine system is 
configured to: 

instruct the application to send descriptor contents to the 
operating system; 

instruct the operating system to perform a non-blocking 50 
socket read from the application; 
instruct the operating system to attempt to pin a page in 
host memory; 

in response to a successful passing of the page pinning, 
instruct the operating system to process the descrip- 55 
tor; 

instruct the operating system to put the descriptor into its 
descriptor table; 

instruct the operating system to send via dynamic 
memory access (DMA) the processed descriptor to a 60 
network interface device descriptor table; 


instruct that a received (RX) packet is input to a network 
interface device physical layer; 
instruct that the packet is processed in the network inter- 
face device physical layer; 

instruct that the packet is transferred from the network 
interface device physical layer to a network interface 
device data link layer; 

instruct that the packet is processed in the network inter- 
face device data link layer; 

instruct that a query is made to see if the packet is listed 
in the descriptor table; 

in response to the packet being listed in the descriptor 
table, transfer the packet to splintering logic; 
in response to an unsuccessful passing of the page pin- 
ning, instruct the operating system to process all head- 
ers in a headers-to-be-processed ring; 
instruct the application to negotiate for memory; and 
instruct the application to perform de-queuing and 
receive a system call. 

25. The system of claim 24, further comprising a software 
application. 

26. The system of claim 24, further comprising a media 
access controller for Ethernet. 

27. The system of claim 24, further comprising a backplane 
interface. 

28. The system of claim 24, wherein the splinter offload 
logic is configured and arranged in a field programmable gate 
array (FPGA). 

29. The system of claim 24, wherein the splinter offload 
logic is configured and arranged in an application specific 
integrated circuit (ASIC). 

30. The system of claim 24, wherein the splinter offload 
logic is configured and arranged in a hardware description or 
behavioral language. 

31. The system of claim 30, wherein the language is C, 
Verilog, or VHSIC hardware description language (VHDL), 
wherein VHSIC refers to very -high-speed integrated circuits. 

32. The system of claim 24, wherein the splinter offload 
logic is configured and arranged in a circuit board. 

33. The system of claim 24, wherein the network interface 
device is configured and arranged to have a bandwidth of 
about 5 Gbps to about 1,000 Gbps. 

34. The system of claim 33, wherein the network interface 
device is configured and arranged to have a bandwidth of 
about 10 Gbps. 

35. The system of claim 24, wherein the network interface 
device is configured and arranged to receive a user datagram 
protocol (UDP) packet. 

36. The system of claim 24, wherein the network interface 
device is configured and arranged to receive a UDP-based 
data transfer protocol (UDT) packet. 

37. The system of claim 24, wherein the network interface 
device is configured and arranged to receive an internet pro- 
tocol (IP) packet. 

38. The system of claim 24, wherein the network interface 
device is configured and arranged to receive a transmission 
control protocol (TCP) packet. 





